Step 1: Clean and prepare your data: There are several entries where values have been deleted to simulate dirty data. Please clean the data with whatever method(s) you believe is best/most suitable. Note that some of the missing values are truly blank (unknown answers) and thus may be impossible to clean; use your discretion.
Step 2: Build your models: Please build machine learning/statistical models in Python to predict the interest rate assigned to a loan. When writing the code associated with each model, please have the first part produce and save the model, followed by a second part that loads and applies the model.
Step 3: Test your models using the data found within the "Holdout for Testing" file. Save the results of the final model (remember you will only predict the first column in holdout test set with your best model results) in a single, separate CSV titled "Results from" *insert your name or UChicago net ID.
Step 4: Submit your work: Please submit all of your code for cleaning, prepping, and modeling your data, your "Results" file, a brief write-up comparing the pros and cons of the modeling techniques you used (no more than a paragraph). Your work will be scored on techniques used (appropriateness and complexity), model performance - measured by RMSE - on the data hold out, an understanding of the techniques you compared in your write-up, and your overall code.
Data Dictionary
%matplotlib inline
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import train_test_split
# Importing Classifier Modules
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import GradientBoostingClassifier
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
plt.style.use("ggplot")
%load_ext lab_black
# import warnings
# warnings.filterwarnings("ignore")
df = pd.read_csv("Data for Cleaning & Modeling.csv")
df.head(3)
/var/folders/08/336_pmn17t56r5jjv51n_8840000gn/T/ipykernel_64897/2781832233.py:1: DtypeWarning: Columns (15) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv("Data for Cleaning & Modeling.csv")
| X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 | X11 | X12 | X13 | X14 | X15 | X16 | X17 | X18 | X19 | X20 | X21 | X22 | X23 | X24 | X25 | X26 | X27 | X28 | X29 | X30 | X31 | X32 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 11.89% | 54734.0 | 80364.0 | $25,000 | $25,000 | $19,080 | 36 months | B | B4 | NaN | < 1 year | RENT | 85000.0 | VERIFIED - income | Aug-09 | Due to a lack of personal finance education an... | debt_consolidation | Debt consolidation for on-time payer | 941xx | CA | 19.48 | 0.0 | Feb-94 | 0.0 | NaN | NaN | 10.0 | 0.0 | 28854.0 | 52.10% | 42.0 | f |
| 1 | 10.71% | 55742.0 | 114426.0 | $7,000 | $7,000 | $673 | 36 months | B | B5 | CNN | < 1 year | RENT | 65000.0 | not verified | May-08 | Just want to pay off the last bit of credit ca... | credit_card | Credit Card payoff | 112xx | NY | 14.29 | 0.0 | Oct-00 | 0.0 | NaN | NaN | 7.0 | 0.0 | 33623.0 | 76.70% | 7.0 | f |
| 2 | 16.99% | 57167.0 | 137225.0 | $25,000 | $25,000 | $24,725 | 36 months | D | D3 | Web Programmer | 1 year | RENT | 70000.0 | VERIFIED - income | Aug-14 | Trying to pay a friend back for apartment brok... | debt_consolidation | mlue | 100xx | NY | 10.50 | 0.0 | Jun-00 | 0.0 | 41.0 | NaN | 10.0 | 0.0 | 19878.0 | 66.30% | 17.0 | f |
# df.isna().sum()
# For Data Exploration
# df.X__.value_counts()
# df.X__.value_counts().count()
# df.X__.unique()
df.X12.value_counts()
MORTGAGE 172112 RENT 136778 OWN 29588 OTHER 124 NONE 36 ANY 1 Name: X12, dtype: int64
columns_subset = [
"X1",
"X4",
"X5",
"X6",
"X7",
"X8",
"X11",
"X12",
"X13",
"X14",
"X15",
"X17",
"X20",
"X21",
"X22",
"X23",
"X24",
"X25",
"X26",
"X27",
"X28",
"X29",
"X30",
"X31",
"X32",
]
# 'X2', 'X3', Id is not useful in prediction
# 'X9', Decide to use X8 instead of X9 (more information)
# 'X10' - Employer or job title (self-filled), relatively unclear and contains 23986 missing records, better dropped.
# 'X16' (Reason for loan) - Too messy and too many missing records, exclude for now.
# 'X18' - self filled and similar to 'X17', can be dropped and use 'X17' instead.
# 'X19' - Zip code first 3 digits provides just a little bit more info than state, can be dropped and use 'X20' instead/
df = pd.read_csv("Data for Cleaning & Modeling.csv", usecols=columns_subset)
column_rename = {
"X1": "Interest_Rate",
"X4": "Loan_Requested",
"X5": "Loan_Funded",
"X6": "Investor_portion",
"X7": "Number_of_Payments",
"X8": "Loan_Grade",
"X11": "Years_Employed",
"X12": "Home_Ownership",
"X13": "Annual_Income",
"X14": "Verification",
"X15": "Issued_MonYY",
"X17": "Loan_Category",
"X20": "State",
"X21": "Debit_Payment_Ratio",
"X22": "Delinquency_Frequency",
"X23": "Credit_Earliest_MonYY",
"X24": "Inquiries_Count",
"X25": "Delinquency_Recency",
"X26": "Public_Record_Recency",
"X27": "Open_Credit_Line_Count",
"X28": "Derogatory_Public_Record_Count",
"X29": "Total_Credit",
"X30": "Credit_Utilization_Rate",
"X31": "Total_Credit_Line_Count",
"X32": "Initial_Status",
}
df = df.rename(columns=column_rename)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 400000 entries, 0 to 399999 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Interest_Rate 338990 non-null object 1 Loan_Requested 399999 non-null object 2 Loan_Funded 399999 non-null object 3 Investor_portion 399999 non-null object 4 Number_of_Payments 399999 non-null object 5 Loan_Grade 338730 non-null object 6 Years_Employed 382462 non-null object 7 Home_Ownership 338639 non-null object 8 Annual_Income 338972 non-null float64 9 Verification 399999 non-null object 10 Issued_MonYY 399999 non-null object 11 Loan_Category 399999 non-null object 12 State 399999 non-null object 13 Debit_Payment_Ratio 399999 non-null float64 14 Delinquency_Frequency 399999 non-null float64 15 Credit_Earliest_MonYY 399999 non-null object 16 Inquiries_Count 399999 non-null float64 17 Delinquency_Recency 181198 non-null float64 18 Public_Record_Recency 51155 non-null float64 19 Open_Credit_Line_Count 399999 non-null float64 20 Derogatory_Public_Record_Count 399999 non-null float64 21 Total_Credit 399999 non-null float64 22 Credit_Utilization_Rate 399733 non-null object 23 Total_Credit_Line_Count 399999 non-null float64 24 Initial_Status 399999 non-null object dtypes: float64(10), object(15) memory usage: 76.3+ MB
df.head(3)
| Interest_Rate | Loan_Requested | Loan_Funded | Investor_portion | Number_of_Payments | Loan_Grade | Years_Employed | Home_Ownership | Annual_Income | Verification | Issued_MonYY | Loan_Category | State | Debit_Payment_Ratio | Delinquency_Frequency | Credit_Earliest_MonYY | Inquiries_Count | Delinquency_Recency | Public_Record_Recency | Open_Credit_Line_Count | Derogatory_Public_Record_Count | Total_Credit | Credit_Utilization_Rate | Total_Credit_Line_Count | Initial_Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 11.89% | $25,000 | $25,000 | $19,080 | 36 months | B | < 1 year | RENT | 85000.0 | VERIFIED - income | Aug-09 | debt_consolidation | CA | 19.48 | 0.0 | Feb-94 | 0.0 | NaN | NaN | 10.0 | 0.0 | 28854.0 | 52.10% | 42.0 | f |
| 1 | 10.71% | $7,000 | $7,000 | $673 | 36 months | B | < 1 year | RENT | 65000.0 | not verified | May-08 | credit_card | NY | 14.29 | 0.0 | Oct-00 | 0.0 | NaN | NaN | 7.0 | 0.0 | 33623.0 | 76.70% | 7.0 | f |
| 2 | 16.99% | $25,000 | $25,000 | $24,725 | 36 months | D | 1 year | RENT | 70000.0 | VERIFIED - income | Aug-14 | debt_consolidation | NY | 10.50 | 0.0 | Jun-00 | 0.0 | 41.0 | NaN | 10.0 | 0.0 | 19878.0 | 66.30% | 17.0 | f |
02-1.Some simple datatype change
# Data type - From "String" To "Float"
## Rate - remove % and change to float type
for col in ["Interest_Rate", "Credit_Utilization_Rate"]:
df[col] = df[col].str.slice(stop=-1).astype("float")
## Dollar Amount - remove $ and change to integer type
for col in ["Loan_Requested", "Loan_Funded", "Investor_portion"]:
df[col] = df[col].str.slice(start=1).str.replace(",", "").astype("float")
# Data type - To Category
## Number of Payments - only 2: 36 months/ 60 months
# Change to 'Payments_is_36' column with 2 value: (1, 0)
df["Payments_is_36"] = (df["Number_of_Payments"] == " 36months").map(
{False: 0, True: 1}
)
df = df.drop("Number_of_Payments", axis=1)
## Simply change to Category
Category = [
# "Loan_Grade", - Need further processing
"Years_Employed",
# "Home_Ownership", - Need further processing
"Verification",
"Loan_Category",
"State",
"Initial_Status",
]
for col in Category:
df[col] = df[col].astype("category")
02-2.Explore the data type and missing values
# df.info()
df.isna().sum()
Interest_Rate 61010 Loan_Requested 1 Loan_Funded 1 Investor_portion 1 Loan_Grade 61270 Years_Employed 17538 Home_Ownership 61361 Annual_Income 61028 Verification 1 Issued_MonYY 1 Loan_Category 1 State 1 Debit_Payment_Ratio 1 Delinquency_Frequency 1 Credit_Earliest_MonYY 1 Inquiries_Count 1 Delinquency_Recency 218802 Public_Record_Recency 348845 Open_Credit_Line_Count 1 Derogatory_Public_Record_Count 1 Total_Credit 1 Credit_Utilization_Rate 267 Total_Credit_Line_Count 1 Initial_Status 1 Payments_is_36 0 dtype: int64
df[df["Loan_Requested"].isna()]
| Interest_Rate | Loan_Requested | Loan_Funded | Investor_portion | Loan_Grade | Years_Employed | Home_Ownership | Annual_Income | Verification | Issued_MonYY | Loan_Category | State | Debit_Payment_Ratio | Delinquency_Frequency | Credit_Earliest_MonYY | Inquiries_Count | Delinquency_Recency | Public_Record_Recency | Open_Credit_Line_Count | Derogatory_Public_Record_Count | Total_Credit | Credit_Utilization_Rate | Total_Credit_Line_Count | Initial_Status | Payments_is_36 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 364111 | 7.69 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 |
This one entry above contains mostly NA records, should be dropped.
df = df[~df["Loan_Requested"].isna()]
df.head(3)
| Interest_Rate | Loan_Requested | Loan_Funded | Investor_portion | Loan_Grade | Years_Employed | Home_Ownership | Annual_Income | Verification | Issued_MonYY | Loan_Category | State | Debit_Payment_Ratio | Delinquency_Frequency | Credit_Earliest_MonYY | Inquiries_Count | Delinquency_Recency | Public_Record_Recency | Open_Credit_Line_Count | Derogatory_Public_Record_Count | Total_Credit | Credit_Utilization_Rate | Total_Credit_Line_Count | Initial_Status | Payments_is_36 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 11.89 | 25000.0 | 25000.0 | 19080.0 | B | < 1 year | RENT | 85000.0 | VERIFIED - income | Aug-09 | debt_consolidation | CA | 19.48 | 0.0 | Feb-94 | 0.0 | NaN | NaN | 10.0 | 0.0 | 28854.0 | 52.1 | 42.0 | f | 0 |
| 1 | 10.71 | 7000.0 | 7000.0 | 673.0 | B | < 1 year | RENT | 65000.0 | not verified | May-08 | credit_card | NY | 14.29 | 0.0 | Oct-00 | 0.0 | NaN | NaN | 7.0 | 0.0 | 33623.0 | 76.7 | 7.0 | f | 0 |
| 2 | 16.99 | 25000.0 | 25000.0 | 24725.0 | D | 1 year | RENT | 70000.0 | VERIFIED - income | Aug-14 | debt_consolidation | NY | 10.50 | 0.0 | Jun-00 | 0.0 | 41.0 | NaN | 10.0 | 0.0 | 19878.0 | 66.3 | 17.0 | f | 0 |
02-3. Dealing with Mon-YY Column
# 'Issued_MonYY', 'Credit_Earliest_MonYY'
# Change both data into timestamp and subtract from today's timestamp to get how long has passed since the record as numeric record
from datetime import datetime
now_timestamp = datetime.timestamp(datetime.now())
df["Issued_MonYY"] = df["Issued_MonYY"].apply(
lambda x: datetime.strptime(str(x), "%b-%y")
)
df["Issued_MonYY"] = df["Issued_MonYY"].apply(lambda x: datetime.timestamp(x))
df["Since_Issued"] = now_timestamp - df["Issued_MonYY"]
df["Credit_Earliest_MonYY"] = df["Credit_Earliest_MonYY"].apply(
lambda x: datetime.strptime(str(x), "%b-%y")
)
df["Credit_Earliest_MonYY"] = df["Credit_Earliest_MonYY"].apply(
lambda x: datetime.timestamp(x)
)
df["Since_First_Crdit"] = now_timestamp - df["Credit_Earliest_MonYY"]
df = df.drop(["Issued_MonYY", "Credit_Earliest_MonYY"], axis=1)
# df.info()
# df.head()
# df.isna().sum()
02-4.Impute the missing value for both "Delinquency_Recency" and "Public_Record_Recency"
# np.sort(df.Delinquency_Recency.unique())
# min: 0 , max: 188
# We suppose the NA value means there's never been a delinquency, we replace the NA with extreme value 999.
df["Delinquency_Recency"] = df.Delinquency_Recency.fillna(999)
# df.Delinquency_Recency.isna().sum()
# np.sort(df.Public_Record_Recency.unique())
# min: 0 , max: 129
# We suppose the NA value means there's never been a public record, we replace the NA with extreme value 999.
df["Public_Record_Recency"] = df.Public_Record_Recency.fillna(999)
# df.Public_Record_Recency.isna().sum()
# Since Interest_Rate is our target variable, we need valid value to train our model
# We drop whichever records with NA in Interest_Rate
df = df[~df["Interest_Rate"].isna()]
df.isna().sum()
Interest_Rate 0 Loan_Requested 0 Loan_Funded 0 Investor_portion 0 Loan_Grade 51866 Years_Employed 14794 Home_Ownership 51959 Annual_Income 51751 Verification 0 Loan_Category 0 State 0 Debit_Payment_Ratio 0 Delinquency_Frequency 0 Inquiries_Count 0 Delinquency_Recency 0 Public_Record_Recency 0 Open_Credit_Line_Count 0 Derogatory_Public_Record_Count 0 Total_Credit 0 Credit_Utilization_Rate 224 Total_Credit_Line_Count 0 Initial_Status 0 Payments_is_36 0 Since_Issued 0 Since_First_Crdit 0 dtype: int64
02-5. Change the "Years_Employed" from categorical to numerical and impute the missing values using interpolate()
# Years_Employed 14794
sns.histplot(df[df["Years_Employed"].notnull()]["Years_Employed"])
plt.show()
# Change df["Years_Employed"] to float
df["Years_Employed"] = df["Years_Employed"].apply(
lambda x: x.replace("< 1", "0").replace("+", "").strip()[:2]
)
df["Years_Employed"] = df["Years_Employed"].astype("float")
df["Years_Employed"].describe()
count 324195.000000 mean 5.991567 std 3.613908 min 0.000000 25% 3.000000 50% 6.000000 75% 10.000000 max 10.000000 Name: Years_Employed, dtype: float64
# Fill NaN values using an interpolation method
df["Years_Employed"] = df["Years_Employed"].interpolate()
# Check again the distribution using .describe()
df["Years_Employed"].describe()
count 338989.000000 mean 5.996765 std 3.574865 min 0.000000 25% 3.000000 50% 6.000000 75% 10.000000 max 10.000000 Name: Years_Employed, dtype: float64
02-6. For "Home_Ownership" column, assign "NONE", "ANY" and NaN into the "OTHER" Category.
# Home ownership status: RENT, OWN, MORTGAGE, OTHER
# df["Home_Ownership"].unique(): ['RENT', 'OWN', 'MORTGAGE', 'NONE', NaN, 'OTHER', 'ANY']
# Since there should be only 4 status, we combine "NONE" and "ANY" into "Other"
df["Home_Ownership"] = (
df["Home_Ownership"].str.replace("NONE", "OTHER").str.replace("ANY", "OTHER")
)
## Home_Ownership (NaN) 51959
### Assign NA into group 'other'
df["Home_Ownership"] = df["Home_Ownership"].fillna("OTHER")
df["Home_Ownership"] = df["Home_Ownership"].astype("category")
df["Home_Ownership"].value_counts()
MORTGAGE 145958 RENT 115958 OTHER 52097 OWN 24976 Name: Home_Ownership, dtype: int64
02-7. Change 'State' to 5 'Region's: 'West', 'Northeast', 'South', 'Midwest', 'Other'
states = {
"AK": "Other",
"AL": "South",
"AR": "South",
"AS": "Other",
"AZ": "West",
"CA": "West",
"CO": "West",
"CT": "Northeast",
"DC": "Northeast",
"DE": "Northeast",
"FL": "South",
"GA": "South",
"GU": "Other",
"HI": "Other",
"IA": "Midwest",
"ID": "West",
"IL": "Midwest",
"IN": "Midwest",
"KS": "Midwest",
"KY": "South",
"LA": "South",
"MA": "Northeast",
"MD": "Northeast",
"ME": "Northeast",
"MI": "West",
"MN": "Midwest",
"MO": "Midwest",
"MP": "Other",
"MS": "South",
"MT": "West",
"NA": "Other",
"NC": "South",
"ND": "Midwest",
"NE": "West",
"NH": "Northeast",
"NJ": "Northeast",
"NM": "West",
"NV": "West",
"NY": "Northeast",
"OH": "Midwest",
"OK": "South",
"OR": "West",
"PA": "Northeast",
"PR": "Other",
"RI": "Northeast",
"SC": "South",
"SD": "Midwest",
"TN": "South",
"TX": "South",
"UT": "West",
"VA": "South",
"VI": "Other",
"VT": "Northeast",
"WA": "West",
"WI": "Midwest",
"WV": "South",
"WY": "West",
}
df["Region"] = df["State"].map(states)
df = df.drop("State", axis=1)
df["Region"] = df["Region"].astype("category")
02-8. Remove outliers in "Annual_Income" and impute missing value using interpolate()
## Annual_Income (NaN) 51751
df["Annual_Income"].describe()
count 2.872380e+05 mean 7.315146e+04 std 5.618967e+04 min 3.000000e+03 25% 4.500000e+04 50% 6.300000e+04 75% 8.807875e+04 max 7.500000e+06 Name: Annual_Income, dtype: float64
fig = px.box(df, x="Annual_Income")
fig.show()
# hover the mouse on the box to view of the box plot values
# Upper fence: 152.606k
# Outlier: > 152,606
df = df[(df["Annual_Income"] <= 152_606) | (df["Annual_Income"].isna())]
df["Annual_Income"].describe()
count 275048.000000 mean 66397.971323 std 29090.473306 min 3000.000000 25% 45000.000000 50% 60000.000000 75% 84000.000000 max 152606.000000 Name: Annual_Income, dtype: float64
# Fill NaN values using an interpolation method
df["Annual_Income"] = df["Annual_Income"].interpolate()
# Check again the distribution using .describe()
df["Annual_Income"].describe()
count 326799.000000 mean 66398.567132 std 27956.946520 min 3000.000000 25% 45000.000000 50% 61500.000000 75% 83000.000000 max 152606.000000 Name: Annual_Income, dtype: float64
02-9. Remove outliers in "Credit_Utilization_Rate" and impute missing value using interpolate()
## Credit_Utilization_Rate 224
df["Credit_Utilization_Rate"].describe()
count 326599.000000 mean 56.245653 std 23.687279 min 0.000000 25% 39.500000 50% 57.800000 75% 74.800000 max 892.300000 Name: Credit_Utilization_Rate, dtype: float64
fig = px.box(df, y="Credit_Utilization_Rate")
fig.show()
# hover the mouse on the box to view of the box plot values
# Upper fence: 127.4
# Outlier: > 127.4
# df[df["Credit_Utilization_Rate"] > 127.4] : 9 entries
df = df[
(df["Credit_Utilization_Rate"] <= 127.4) | (df["Credit_Utilization_Rate"].isna())
]
sns.displot(df["Credit_Utilization_Rate"], kde=True)
plt.show()
# Fill NaN values using an interpolation method
df["Credit_Utilization_Rate"] = df["Credit_Utilization_Rate"].interpolate()
# Check again the distribution using .describe()
sns.displot(df["Credit_Utilization_Rate"], kde=True)
plt.show()
02-10. Assign the missing values into a new loan grade called "O", as in "Others"
df["Loan_Grade"].unique()
array(['B', 'D', 'C', 'A', 'E', 'F', nan, 'G'], dtype=object)
## Loan_Grade (NaN) 51866
### Assign NA into a new group called 'O', Others.
df["Loan_Grade"] = df["Loan_Grade"].fillna("O")
df["Loan_Grade"] = df["Loan_Grade"].astype("category")
df.isna().sum()
# No Missing Values!
Interest_Rate 0 Loan_Requested 0 Loan_Funded 0 Investor_portion 0 Loan_Grade 0 Years_Employed 0 Home_Ownership 0 Annual_Income 0 Verification 0 Loan_Category 0 Debit_Payment_Ratio 0 Delinquency_Frequency 0 Inquiries_Count 0 Delinquency_Recency 0 Public_Record_Recency 0 Open_Credit_Line_Count 0 Derogatory_Public_Record_Count 0 Total_Credit 0 Credit_Utilization_Rate 0 Total_Credit_Line_Count 0 Initial_Status 0 Payments_is_36 0 Since_Issued 0 Since_First_Crdit 0 Region 0 dtype: int64
02-12. Create dummy variables for categorical variables.
df = pd.get_dummies(df, drop_first=True)
df.head()
| Interest_Rate | Loan_Requested | Loan_Funded | Investor_portion | Years_Employed | Annual_Income | Debit_Payment_Ratio | Delinquency_Frequency | Inquiries_Count | Delinquency_Recency | Public_Record_Recency | Open_Credit_Line_Count | Derogatory_Public_Record_Count | Total_Credit | Credit_Utilization_Rate | Total_Credit_Line_Count | Payments_is_36 | Since_Issued | Since_First_Crdit | Loan_Grade_B | Loan_Grade_C | Loan_Grade_D | Loan_Grade_E | Loan_Grade_F | Loan_Grade_G | Loan_Grade_O | Home_Ownership_OTHER | Home_Ownership_OWN | Home_Ownership_RENT | Verification_VERIFIED - income source | Verification_not verified | Loan_Category_credit_card | Loan_Category_debt_consolidation | Loan_Category_educational | Loan_Category_home_improvement | Loan_Category_house | Loan_Category_major_purchase | Loan_Category_medical | Loan_Category_moving | Loan_Category_other | Loan_Category_renewable_energy | Loan_Category_small_business | Loan_Category_vacation | Loan_Category_wedding | Initial_Status_w | Region_Northeast | Region_Other | Region_South | Region_West | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 11.89 | 25000.0 | 25000.0 | 19080.0 | 0.0 | 85000.0 | 19.48 | 0.0 | 0.0 | 999.0 | 999.0 | 10.0 | 0.0 | 28854.0 | 52.1 | 42.0 | 0 | 4.271869e+08 | 9.162073e+08 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 10.71 | 7000.0 | 7000.0 | 673.0 | 0.0 | 65000.0 | 14.29 | 0.0 | 0.0 | 999.0 | 999.0 | 7.0 | 0.0 | 33623.0 | 76.7 | 7.0 | 0 | 4.666717e+08 | 7.059133e+08 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 16.99 | 25000.0 | 25000.0 | 24725.0 | 1.0 | 70000.0 | 10.50 | 0.0 | 0.0 | 41.0 | 999.0 | 10.0 | 0.0 | 19878.0 | 66.3 | 17.0 | 0 | 2.694205e+08 | 7.164541e+08 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 13.11 | 1200.0 | 1200.0 | 1200.0 | 10.0 | 54000.0 | 5.47 | 0.0 | 0.0 | 64.0 | 999.0 | 5.0 | 0.0 | 2584.0 | 40.4 | 31.0 | 0 | 4.088665e+08 | 1.202882e+09 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | 13.57 | 10800.0 | 10800.0 | 10692.0 | 6.0 | 32000.0 | 11.63 | 0.0 | 1.0 | 58.0 | 999.0 | 14.0 | 0.0 | 3511.0 | 25.6 | 40.0 | 0 | 4.192381e+08 | 8.268697e+08 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
df.shape
(326790, 49)
02-13. Split the data into X (features) and y (label), which is the interest rate.
X, y = df.drop("Interest_Rate", axis=1), df["Interest_Rate"]
03-1. Splitting X,y into training and testing set.
(X_train, X_test, y_train, y_test) = train_test_split(
X, y, test_size=0.2, random_state=1
)
03-2. Import necessary library
# Import DecisionTreeRegressor from sklearn.tree
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error as MSE
# from sklearn.model_selection import RandomizedSearchCV
rf = RandomForestRegressor(random_state=1)
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=10)]
# Number of features to consider at every split
max_features = ["auto", "sqrt"]
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num=11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {
"n_estimators": n_estimators,
"max_features": max_features,
"max_depth": max_depth,
"min_samples_split": min_samples_split,
"min_samples_leaf": min_samples_leaf,
"bootstrap": bootstrap,
}
# Use the random grid to search for best hyperparameters
# First create the base model to tune
####rf = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation,
# search across 10 different combinations, and use all available cores
####rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 5, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
####rf_random.fit(X_train, y_train)
# rf_random.best_params_
03-3. Random Forest Regressor Model
# Instantiate a random forests regressor 'rf' 400 estimators
SEED = 1
rf = RandomForestRegressor(n_estimators=400, min_samples_leaf=0.13, random_state=SEED)
# Fit 'rf' to the training set
rf.fit(X_train, y_train)
RandomForestRegressor(min_samples_leaf=0.13, n_estimators=400, random_state=1)
# Compute Train / Test set RMSE
# Predict the labels of the training set
y_pred_train = rf.predict(X_train)
# Compute y_pred
y_pred_test = rf.predict(X_test)
# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = -cross_val_score(
rf, X_train, y_train, cv=10, scoring="neg_mean_squared_error", n_jobs=-1
)
# Evaluate the training set RMSE of rf
RMSE_train = (MSE(y_train, y_pred_train)) ** (1 / 2)
# Evaluate the test set RMSE of rf
RMSE_test = (MSE(y_test, y_pred_test)) ** (1 / 2)
# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean()) ** (1 / 2)
# Print RMSE_train
print("Train RMSE: {:.2f}".format(RMSE_train))
# Print RMSE_test
print("Test RMSE of rf: {:.2f}".format(RMSE_test))
# Print RMSE_CV
print("CV RMSE: {:.2f}".format(RMSE_CV))
Train RMSE: 3.85 Test RMSE of rf: 3.87 CV RMSE: 3.85
# Instantiate dt, Decision Tree
dt = DecisionTreeRegressor(max_depth=10, min_samples_leaf=0.13, random_state=3)
# Fit dt to the training set
dt.fit(X_train, y_train)
DecisionTreeRegressor(max_depth=10, min_samples_leaf=0.13, random_state=3)
# Compute Train / Test set RMSE
# Predict the labels of the training set
y_pred_train = dt.predict(X_train)
# Compute y_pred
y_pred_test = dt.predict(X_test)
# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = -cross_val_score(
dt, X_train, y_train, cv=10, scoring="neg_mean_squared_error", n_jobs=-1
)
# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, y_pred_train)) ** (1 / 2)
# Evaluate the test set RMSE of dt
RMSE_test = (MSE(y_test, y_pred_test)) ** (1 / 2)
# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean()) ** (1 / 2)
# Print RMSE_train
print("Train RMSE: {:.2f}".format(RMSE_train))
# Print RMSE_test
print("Test RMSE of dt: {:.2f}".format(RMSE_test))
# Print RMSE_CV
print("CV RMSE: {:.2f}".format(RMSE_CV))
Train RMSE: 3.65 Test RMSE of dt: 3.67 CV RMSE: 3.65
# Instantiate dt, Decision Tree
dt2 = DecisionTreeRegressor(max_depth=10, min_samples_leaf=0.05, random_state=3)
# Fit dt to the training set
dt2.fit(X_train, y_train)
# Compute Train / Test set RMSE
# Predict the labels of the training set
y_pred_train = dt2.predict(X_train)
# Compute y_pred
y_pred_test = dt2.predict(X_test)
# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = -cross_val_score(
dt2, X_train, y_train, cv=10, scoring="neg_mean_squared_error", n_jobs=-1
)
# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, y_pred_train)) ** (1 / 2)
# Evaluate the test set RMSE of dt
RMSE_test = (MSE(y_test, y_pred_test)) ** (1 / 2)
# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean()) ** (1 / 2)
# Print RMSE_train
print("Train RMSE: {:.2f}".format(RMSE_train))
# Print RMSE_test
print("Test RMSE of dt: {:.2f}".format(RMSE_test))
# Print RMSE_CV
print("CV RMSE: {:.2f}".format(RMSE_CV))
Train RMSE: 2.97 Test RMSE of dt: 3.00 CV RMSE: 2.98
importances_dt = pd.Series(dt.feature_importances_, index=X.columns)
# Sort importances_rf
sorted_importances_dt = importances_dt.sort_values(ascending=False).head(5)
# Make a horizontal bar plot
sorted_importances_dt.plot(kind="barh", color="blue")
plt.show()
04-1. Read in the data with necessary columns.
dftest = pd.read_csv("Holdout for Testing.csv", usecols=columns_subset)
dftest = dftest.drop("X1", axis=1)
column_rename = {
# "X1": "Interest_Rate",
"X4": "Loan_Requested",
"X5": "Loan_Funded",
"X6": "Investor_portion",
"X7": "Number_of_Payments",
"X8": "Loan_Grade",
"X11": "Years_Employed",
"X12": "Home_Ownership",
"X13": "Annual_Income",
"X14": "Verification",
"X15": "Issued_MonYY",
"X17": "Loan_Category",
"X20": "State",
"X21": "Debit_Payment_Ratio",
"X22": "Delinquency_Frequency",
"X23": "Credit_Earliest_MonYY",
"X24": "Inquiries_Count",
"X25": "Delinquency_Recency",
"X26": "Public_Record_Recency",
"X27": "Open_Credit_Line_Count",
"X28": "Derogatory_Public_Record_Count",
"X29": "Total_Credit",
"X30": "Credit_Utilization_Rate",
"X31": "Total_Credit_Line_Count",
"X32": "Initial_Status",
}
dftest = dftest.rename(columns=column_rename)
04-2. Replicate the cleaning and preprcessing steps for train data on holdout test data
# Data type - From "String" To "Float"
## Rate - remove % and change to float type
dftest["Credit_Utilization_Rate"] = (
dftest["Credit_Utilization_Rate"].str.slice(stop=-1).astype("float")
)
## Dollar Amount - remove $ and change to integer type
for col in ["Loan_Requested", "Loan_Funded", "Investor_portion"]:
dftest[col] = dftest[col].str.slice(start=1).str.replace(",", "").astype("float")
# Data type - To Category
## Number of Payments - only 2: 36 months/ 60 months
# Change to 'Payments_is_36' column with 2 value: (1, 0)
dftest["Payments_is_36"] = (dftest["Number_of_Payments"] == " 36months").map(
{False: 0, True: 1}
)
dftest = dftest.drop("Number_of_Payments", axis=1)
## Simply change to Category
Category = [
"Loan_Grade", # No need to reassign groups, ex.NaN
"Years_Employed",
"Home_Ownership", # No need to reassign groups, ex.'NONE', 'ANY', NaN
"Verification",
"Loan_Category",
"State",
"Initial_Status",
]
for col in Category:
dftest[col] = dftest[col].astype("category")
dftest.isna().sum()
Loan_Requested 0 Loan_Funded 0 Investor_portion 0 Loan_Grade 0 Years_Employed 4382 Home_Ownership 0 Annual_Income 0 Verification 0 Issued_MonYY 0 Loan_Category 0 State 0 Debit_Payment_Ratio 0 Delinquency_Frequency 0 Credit_Earliest_MonYY 0 Inquiries_Count 0 Delinquency_Recency 38704 Public_Record_Recency 66161 Open_Credit_Line_Count 0 Derogatory_Public_Record_Count 0 Total_Credit 0 Credit_Utilization_Rate 30 Total_Credit_Line_Count 0 Initial_Status 0 Payments_is_36 0 dtype: int64
dftest.head(3)
| Loan_Requested | Loan_Funded | Investor_portion | Loan_Grade | Years_Employed | Home_Ownership | Annual_Income | Verification | Issued_MonYY | Loan_Category | State | Debit_Payment_Ratio | Delinquency_Frequency | Credit_Earliest_MonYY | Inquiries_Count | Delinquency_Recency | Public_Record_Recency | Open_Credit_Line_Count | Derogatory_Public_Record_Count | Total_Credit | Credit_Utilization_Rate | Total_Credit_Line_Count | Initial_Status | Payments_is_36 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6000.0 | 6000.0 | 6000.0 | C | 10+ years | MORTGAGE | 68000.0 | VERIFIED - income | 15-Mar | debt_consolidation | OH | 28.31 | 0 | 2-Nov | 1 | 26.0 | NaN | 18 | 0 | 19861 | 64.5 | 33 | f | 0 |
| 1 | 24000.0 | 24000.0 | 24000.0 | A | 8 years | RENT | 110480.0 | VERIFIED - income | 15-Mar | debt_consolidation | CA | 16.03 | 0 | Dec-68 | 1 | NaN | NaN | 12 | 0 | 17001 | 26.2 | 36 | w | 0 |
| 2 | 35000.0 | 35000.0 | 35000.0 | C | 10+ years | MORTGAGE | 86000.0 | VERIFIED - income | 15-Mar | debt_consolidation | PA | 32.49 | 0 | Oct-98 | 0 | NaN | NaN | 16 | 0 | 25797 | 49.9 | 33 | w | 0 |
# dftest["Issued_MonYY"].unique()
dftest = dftest.sort_values("Credit_Earliest_MonYY").reset_index()
# dftest.head()
# 'Credit_Earliest_MonYY' Date format is not consistent, need to be reformatted and changed to datetime object
# Explore: dftest["Credit_Earliest_MonYY"][34000:34100]
# After reviewing, we know from 0~34016 are in "%d-%b" format(this year), and after that are all "%b-%y"
from datetime import datetime
dftest["Credit_Earliest_MonYY"][:34017] = dftest["Credit_Earliest_MonYY"][:34017].apply(
lambda x: datetime.strptime(str(x + "-22"), "%d-%b-%y")
)
dftest["Credit_Earliest_MonYY"][34017:] = dftest["Credit_Earliest_MonYY"][34017:].apply(
lambda x: datetime.strptime(str(x), "%b-%y")
)
# 'Issued_MonYY' change to datetime object
dftest["Issued_MonYY"] = dftest["Issued_MonYY"].apply(
lambda x: datetime.strptime(str(x + "-22"), "%d-%b-%y")
)
/var/folders/08/336_pmn17t56r5jjv51n_8840000gn/T/ipykernel_64897/1547829543.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /var/folders/08/336_pmn17t56r5jjv51n_8840000gn/T/ipykernel_64897/1547829543.py:11: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# 'Issued_MonYY', 'Credit_Earliest_MonYY'
# Change both data into timestamp and subtract from today's timestamp to get how long has passed since the record as numeric record
now_timestamp = datetime.timestamp(datetime.now())
dftest["Issued_MonYY"] = dftest["Issued_MonYY"].apply(lambda x: datetime.timestamp(x))
dftest["Since_Issued"] = now_timestamp - dftest["Issued_MonYY"]
dftest["Credit_Earliest_MonYY"] = dftest["Credit_Earliest_MonYY"].apply(
lambda x: datetime.timestamp(x)
)
dftest["Since_First_Crdit"] = now_timestamp - dftest["Credit_Earliest_MonYY"]
dftest = dftest.drop(["Issued_MonYY", "Credit_Earliest_MonYY"], axis=1)
dftest = dftest.sort_values("index")
dftest = dftest.drop("index", axis=1)
# Delinquency_Recency (NaN) 38704
dftest["Delinquency_Recency"] = dftest.Delinquency_Recency.fillna(999)
# Public_Record_Recency (NaN) 66161
dftest["Public_Record_Recency"] = dftest.Public_Record_Recency.fillna(999)
# Change dftest["Years_Employed"] from str to float
dftest["Years_Employed"] = dftest["Years_Employed"].apply(
lambda x: x.replace("< 1", "0").replace("+", "").strip()[:2]
)
dftest["Years_Employed"] = dftest["Years_Employed"].astype("float")
# Fill NaN values using an interpolation method
dftest["Years_Employed"] = dftest["Years_Employed"].interpolate()
dftest["Region"] = dftest["State"].map(states)
dftest["Region"] = dftest["Region"].astype("category")
dftest = dftest.drop("State", axis=1)
# Credit_Utilization_Rate (NaN) 30
dftest["Credit_Utilization_Rate"] = dftest["Credit_Utilization_Rate"].interpolate()
dftest.isna().sum()
Loan_Requested 0 Loan_Funded 0 Investor_portion 0 Loan_Grade 0 Years_Employed 0 Home_Ownership 0 Annual_Income 0 Verification 0 Loan_Category 0 Debit_Payment_Ratio 0 Delinquency_Frequency 0 Inquiries_Count 0 Delinquency_Recency 0 Public_Record_Recency 0 Open_Credit_Line_Count 0 Derogatory_Public_Record_Count 0 Total_Credit 0 Credit_Utilization_Rate 0 Total_Credit_Line_Count 0 Initial_Status 0 Payments_is_36 0 Since_Issued 0 Since_First_Crdit 0 Region 0 dtype: int64
04-3. Create dummy variables for categorical variables.
dftest = pd.get_dummies(dftest, drop_first=True)
dftest.head()
| Loan_Requested | Loan_Funded | Investor_portion | Years_Employed | Annual_Income | Debit_Payment_Ratio | Delinquency_Frequency | Inquiries_Count | Delinquency_Recency | Public_Record_Recency | Open_Credit_Line_Count | Derogatory_Public_Record_Count | Total_Credit | Credit_Utilization_Rate | Total_Credit_Line_Count | Payments_is_36 | Since_Issued | Since_First_Crdit | Loan_Grade_B | Loan_Grade_C | Loan_Grade_D | Loan_Grade_E | Loan_Grade_F | Loan_Grade_G | Home_Ownership_OWN | Home_Ownership_RENT | Verification_VERIFIED - income source | Verification_not verified | Loan_Category_credit_card | Loan_Category_debt_consolidation | Loan_Category_home_improvement | Loan_Category_house | Loan_Category_major_purchase | Loan_Category_medical | Loan_Category_moving | Loan_Category_other | Loan_Category_renewable_energy | Loan_Category_small_business | Loan_Category_vacation | Loan_Category_wedding | Initial_Status_w | Region_Northeast | Region_Other | Region_South | Region_West | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 11980 | 6000.0 | 6000.0 | 6000.0 | 10.0 | 68000.0 | 28.31 | 0 | 1 | 26.0 | 999.0 | 18 | 0 | 19861 | 64.5 | 33 | 0 | 2.897033e+07 | 8.925535e+06 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 42325 | 24000.0 | 24000.0 | 24000.0 | 8.0 | 110480.0 | 16.03 | 0 | 1 | 999.0 | 999.0 | 12 | 0 | 17001 | 26.2 | 36 | 0 | 2.897033e+07 | -1.445276e+09 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 74927 | 35000.0 | 35000.0 | 35000.0 | 10.0 | 86000.0 | 32.49 | 0 | 0 | 999.0 | 999.0 | 16 | 0 | 25797 | 49.9 | 33 | 0 | 2.897033e+07 | 7.690727e+08 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
| 48469 | 10000.0 | 10000.0 | 10000.0 | 10.0 | 30000.0 | 32.96 | 0 | 1 | 999.0 | 114.0 | 13 | 1 | 9586 | 43.8 | 21 | 0 | 2.897033e+07 | 7.584419e+08 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 8832 | 24000.0 | 24000.0 | 24000.0 | 10.0 | 82500.0 | 31.03 | 0 | 0 | 48.0 | 999.0 | 27 | 0 | 31842 | 41.3 | 43 | 0 | 2.897033e+07 | 6.329935e+06 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
# Check if there are 48 features (same as X)
dftest.shape
(80000, 45)
# X.columns
# dftest.columns
# short of these 3 columns: 'Loan_Grade_O' 'Home_Ownership_OTHER' 'Loan_Category_educational'
missing = ["Loan_Grade_O", "Home_Ownership_OTHER", "Loan_Category_educational"]
for col in missing:
dftest[col] = pd.DataFrame(["0"] * len(dftest)).astype("uint8")
# FutureWarning:
# The feature names should match those that were passed during fit. Starting version 1.2,
# an error will be raised. Feature names must be in the same order as they were in fit.
# X.columns
cols = [
"Loan_Requested",
"Loan_Funded",
"Investor_portion",
"Years_Employed",
"Annual_Income",
"Debit_Payment_Ratio",
"Delinquency_Frequency",
"Inquiries_Count",
"Delinquency_Recency",
"Public_Record_Recency",
"Open_Credit_Line_Count",
"Derogatory_Public_Record_Count",
"Total_Credit",
"Credit_Utilization_Rate",
"Total_Credit_Line_Count",
"Payments_is_36",
"Since_Issued",
"Since_First_Crdit",
"Loan_Grade_B",
"Loan_Grade_C",
"Loan_Grade_D",
"Loan_Grade_E",
"Loan_Grade_F",
"Loan_Grade_G",
"Loan_Grade_O",
"Home_Ownership_OTHER",
"Home_Ownership_OWN",
"Home_Ownership_RENT",
"Verification_VERIFIED - income source",
"Verification_not verified",
"Loan_Category_credit_card",
"Loan_Category_debt_consolidation",
"Loan_Category_educational",
"Loan_Category_home_improvement",
"Loan_Category_house",
"Loan_Category_major_purchase",
"Loan_Category_medical",
"Loan_Category_moving",
"Loan_Category_other",
"Loan_Category_renewable_energy",
"Loan_Category_small_business",
"Loan_Category_vacation",
"Loan_Category_wedding",
"Initial_Status_w",
"Region_Northeast",
"Region_Other",
"Region_South",
"Region_West",
]
dftest = dftest[cols]
dftest_pred_dt2 = dt2.predict(dftest)
test = pd.read_csv("Holdout for Testing.csv")
predict = pd.DataFrame(
{"Loan_Id": test["X2"], "Predicted_Interest_Rate": dftest_pred_dt2}
)
# "Titanic Results from" *insert your name or UChicago net ID.
predict.to_csv("Loan_Results_from_Kelsey-Liu.csv", index=False)
Between 2 machine leaning models, Random Forest and Decision Tree, Decision Tree achieves slightly better result with the paramters (max_depth = 10, min_samples_leaf = 0.13) we originally set, as the RMSE for train, test, CV all being 0.2 lower than Random Forest's. Furthermore, when we adjust the parameter, the RMSE are about 0.7 lower than the baseline model. This is because the baseline model is said to underfit the data, which results in CV error(3.65) being very similar to training set error (3.65), but still greater than our desired error. To remedy underfitting, we can either increase max_depth or decease min_samples_leaf, here what I did is decreasing the min_samples_leaf from 0.13 to 0.05. If we want to achieve even better result, as there is not too much room for parameter tuning, we can consider including some of the columns that were excluded in the beginning (ex. do text mining on "Reason for loan"), dedicating more efforts into feature engineering to transform them into useful information, or gathering more relevant features.